Model Selection

Video Understanding

# Video Understanding

Test With Sdfvd

A video understanding model fine-tuned based on MCG-NJU/videomae-base, with average performance on the evaluation set (accuracy 50%)

Video Processing

Internvl3 1B Hf

InternVL3 is an advanced series of multimodal large language models, demonstrating exceptional multimodal perception and reasoning capabilities, supporting image, video, and text inputs.

Transformers Other

Datatrain Videomae Base Finetuned Lr1e 07 Poly3

A video understanding model fine-tuned from MCG-NJU/videomae-base, trained on an unknown dataset with an accuracy of 11.1%

Video Processing

Videomae Base Finetuned 1e 08 Bs4 Ep2

A video understanding model fine-tuned based on MCG-NJU/videomae-base, trained on an unknown dataset

Video Processing

Qwen2.5 Omni 7B GPTQ 4bit

A 4-bit GPTQ quantized version of the Qwen2.5-Omni-7B model, supporting multilingual and multimodal tasks.

Multimodal Fusion

Safetensors Supports Multiple Languages

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame96 S1t6

Adopts an innovative slow-fast architecture to balance temporal resolution and spatial details in video understanding, overcoming the sequence length limitations of traditional large language models.

Videollama2.1 7B AV CoT

VideoLLaMA2.1-7B-AV is a multimodal large language model focused on audio-visual question answering tasks, capable of processing both video and audio inputs to provide high-quality question answering and description generation.

Transformers English

VideoMind is a multimodal agent framework that enhances video reasoning capabilities by simulating human thought processes (such as task decomposition, moment localization & verification, and answer synthesis).

Slowfast Video Mllm Qwen2 7b Convnext 576 Frame64 S1t4

A video multimodal large language model using a slow-fast architecture, balancing temporal resolution and spatial details, supporting 64-frame video understanding

Tinyllava Video Qwen2.5 3B Group 16 512

TinyLLaVA-Video is a video understanding model based on Qwen2.5-3B and siglip-so400m-patch14-384, utilizing a grouped resampler for video frame processing

Internvl 2 5 HiCo R16

InternVideo2.5 is a video multimodal large language model (MLLM) enhanced by long and rich context (LRC) modeling, built upon InternVL2.5.

Transformers English

Llava NeXT Video 7B Hf

LLaVA-NeXT-Video-7B-hf is a video-based multimodal model capable of processing video and text inputs to generate text outputs.

Video-to-Text English

Videomae Base Finetuned Signlanguage Last 3

A video understanding model fine-tuned based on MCG-NJU/videomae-base, specialized in sign language recognition tasks

Video Processing

Internvl2 5 4B AWQ

InternVL2_5-4B-AWQ is the AWQ quantized version of InternVL2_5-4B using autoawq, supporting multilingual and multimodal tasks.

Transformers Other

Magma is a foundational multimodal AI agent model capable of processing image and text inputs to generate text outputs, with complex interaction abilities in both virtual and real-world environments.

Smolvlm2 500M Video Instruct

A lightweight multimodal model designed for analyzing video content, capable of processing video, image, and text inputs to generate text outputs.

Transformers English

Fluxi AI Small Vision

Fluxi AI is a multimodal intelligent assistant based on Qwen2-VL-7B-Instruct, capable of processing text, images, and videos, with special optimization for Portuguese language support.

Transformers Other

Internlm Xcomposer2d5 7b Chat

InternLM-XComposer2.5-Chat is a dialogue model trained based on InternLM-XComposer2.5-7B, showing significant improvements in multimodal instruction following and open-ended dialogue capabilities.

Eagle2 is a high-performance vision-language model family introduced by NVIDIA, focusing on enhancing the performance of open-source vision-language models through data strategies and training approaches. Eagle2-2B is the lightweight model in this series, achieving outstanding efficiency and speed while maintaining robust performance.

Transformers Other

Eagle2-9B is the latest Vision-Language Model (VLM) released by NVIDIA, achieving a perfect balance between performance and inference speed. It is built on the Qwen2.5-7B-Instruct language model and the Siglip+ConvNext vision model, supporting multilingual and multimodal tasks.

Transformers Other

Llava Mini Llama 3.1 8b

LLaVA-Mini is an efficient multimodal large model that significantly improves the efficiency of image and video understanding by using only 1 visual token to represent an image.

Xgen Mm Vid Phi3 Mini R V1.5 128tokens 8frames

xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model equipped with an explicit temporal encoder, specifically designed for video content understanding.

Safetensors English

Mplug Owl3 7B 240728

mPLUG-Owl3 is a cutting-edge multimodal large language model designed to tackle the challenges of long image sequence understanding, supporting single-image, multi-image, and video tasks.

Safetensors English

Minicpm V 2 6 Int4

MiniCPM-V 2.6 is a multimodal vision-language model supporting image-to-text conversion with multilingual processing capabilities.

Transformers Other

Llava NeXT Video 7B DPO

LLaVA-Next-Video is an open-source multimodal dialogue model, fine-tuned with multimodal instruction-following data on large language models, supporting video and text multimodal interactions.

Llava NeXT Video 7B

LLaVA-Next-Video is an open-source multimodal dialogue robot, fine-tuned from a large language model, supporting multimodal interaction with video and text.

Model Timesformer Subset 02

A video understanding model based on the TimeSformer architecture, fine-tuned on an unknown dataset with an accuracy of 88.52%

Video Processing

MMICL Instructblip T5 Xxl

MMICL is a multimodal vision-language model combining blip2/instructblip, capable of analyzing and understanding multiple images while following instructions.

Transformers English

Videomae Base Ipm All Videos

A vision model fine-tuned from the VideoMAE base model on an unknown video dataset, primarily used for video understanding tasks, achieving 85.59% accuracy on the evaluation set.

Video Processing

Videomae Base Finetuned

A video understanding model fine-tuned on an unknown dataset based on MCG-NJU/videomae-base, achieving an F1 score of 0.7147

Video Processing

Videomae Base Finetuned

A video understanding model fine-tuned on an unknown dataset based on the VideoMAE base model, achieving 86.41% accuracy on the evaluation set

Video Processing

ViViT is an extension of the Vision Transformer (ViT) for video processing, primarily used for downstream tasks such as video classification.

Video Processing

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase